Using Overload Prevention Mechanisms

Learn mechanisms to prevent overwhelming networks and services.

When we have a small set of services, misbehaving applications generally cause small problems. This is because there is usually an overabundance of network capacity to absorb badly behaving applications within a data center, and with a small set of services, it is usually intuitive to figure out what would cause the issue.

When we have a large number of applications running, our network and our machines are usually oversubscribed. Oversubscribed means that our network and systems cannot handle all our applications running at 100%. Oversubscription is common in networks or clusters to control costs. This works because, at any given time, most applications ebb and flow with network traffic, central processing unit (CPU), and memory.

widget

An application that suddenly experiences some type of bug can go into retry loops that quickly overwhelm a service. In addition, if some catastrophic event occurs that takes
a service offline, trying to bring the application back online can cause the service to go down as it is overwhelmed by requests that are queuing on all clients.

Worse is what can happen to the network. If the network becomes overwhelmed or when cloud devices have their queries per second (QPS) exceeded, other applications can have their traffic adversely affected. This can mask the true cause of our problems.

There are several ways of preventing these types of problems, with the two most common being the following:

  • Circuit breakers

  • Backoff implementations

Each of these prevention mechanisms has the same idea: when failures occur, prevent retries from overwhelming the service.

Infrastructure services are often an overlooked use case for these prevention mechanisms. Many times, we concentrate on our public services, but infrastructure services are just as important. If that service is critical and becomes overwhelmed, it can be difficult to restore it without manually touching other services to reduce load.

Let's have a look at one of the more popular methods: the circuit breaker.

AWS had an outage that affected AWS customers across the world when a misbehaving application began sending too much traffic across a network boundary between their customer network and their core network where AWS critical services live. This was restricted to their us-east-1 region, but the effects were felt by their customers in multiple locations. The problem was twofold, comprising the following factors:

  • A misbehaving application sending too many requests.
  • Their clients didn’t back off on failure.

It is the second issue that caused the long failure. AWS had been doing the right thing in having a standard client for RPCs that invoked incrementing backoffs when requests failed. However, for some reason, the client library did not perform as expected in this case.

This means that instead of the load reducing itself as the endpoints became overwhelmed, they went into some type of infinite loop that kept increasing the load on the affected systems and overwhelmed their network cross-connects. This overwhelming of cross- connects disabled their monitoring and prevented them from seeing the problem. The result was they had to try reducing their network load by scaling back application traffic while trying to not affect the customer services that were still working—a feat I would not envy.

This case points to how important it is to prevent application retries when failures occur. To read more on this from Amazon, see the following web page.

Using circuit breakers#

Circuit breakers work by wrapping RPC calls within a client that will automatically fail any attempt once a threshold is reached. All calls then simply return a failure without actually making any attempt for some amount of time.

Circuit breakers have three modes, as follows:

  • Closed

  • Open

  • Half-open

A circuit breaker is in a closed state when everything is working. This is the normal state.

A circuit breaker is in an open state after some amount of failures trips the breaker. When in this state, all requests are automatically failed without trying to send the message. This period lasts for some amount of time. It is suggested that this time be some set period and some randomness to prevent spontaneous synchronization.

A circuit breaker moves into a half-open state after some time in the open state. Once in the half-open state, some number of requests that are requested are actually tried. If some threshold of success is passed, the circuit breaker moves back into the closed state. If not, the circuit breaker moves into the open state again.

svg viewer

We can find several different circuit-breaker implementations for Go, but one of the most popular was developed at Sony, called gobreaker.

Let's look at how we might use it to limit retries for HTTP queries, as follows:

/
main.go
A demo circuit breaker to limit retries for HTTP queries

The preceding code defines the following:

  • Lines 45–48: An HTTP type that holds both of these:

    • An http.Client for making HTTP requests.

    • A circuit breaker for HTTP requests.

  • Lines 50–75: A New() constructor for our HTTP type. It creates a circuit breaker with settings that enforces the following:

    • Allows one request at a time when in the half-open state.

    • Has a 30-second period where we are half-open after being in a closed state.

    • Has a closed state that lasts 10 seconds.

    • Enters the closed state if we have five consecutive failures.

  • Lines 77–98: A Get() method on HTTP that does the following:

    • Checks that *http.Request has a timeout define.

    • Calls the circuit breaker on our client.Do() method.

    • Converts the returned interface{} to the underlying *http.Response.

This code gives us a robust HTTP client wrapped with a circuit breaker. A better version of this might pass in the settings to the constructor, but we wanted it to be packed neatly for the example.

Using backoff implementations#

A backoff implementation wraps RPCs with a client that will retry with a pause between attempts. These pauses get longer and longer until they reach some maximum value.

Backoff implementations can have a wide range of methods for calculating the time period. We will concentrate on exponential backoff in this section.

Exponential backoff simply adds delays to each attempt that increases exponentially
as failures mount. As with circuit breakers, there are many packages offering backoff implementations.

For this example, we will use https://pkg.go.dev/github.com/cenk/backoff, which is an implementation of Google's HTTP backoff library for Java.

This backoff implementation offers many important features that Google has found useful over years of studying service failures. One of the most important features in the library is adding random values to sleep times between retries. This prevents multiple clients from syncing their retry attempts.

Other important features include the ability to honor context cancellations and supply maximum retry attempts.

Let's look at how we might use it to limit retries for HTTP queries, as follows:

/
main.go
A demo of backoff to limit retries

The preceding code defines the following:

  • Lines 46–48: An HTTP type that holds both of these:

    • An http.Client for making HTTP requests.

    • An exponential backoff for HTTP requests.

  • Lines 50–54: A New() constructor for our HTTP type.

  • Lines 56–86: A Get() method on HTTP.

  • It also does the following:

    • Creates a func() error that attempts our request called op.

    • Runs op with retries and exponential delays.

    • Creates an exponential backoff with default values.

    • Wraps that backoff in BackOffContext to honor our context deadline.

For a list of the default values for ExponentialBackoff, see this.

Combining circuit breakers with backoff#

When choosing a prevention implementation, another option is to combine a circuit breaker with backoff for a more robust implementation.

A backoff implementation can be set to have a maximum time in which retries are occurring. Wrapping that inside a circuit breaker to make any set of failed attempts to trigger our circuit breaker not only potentially reduces our load by slowing our requests, but we can also stop these attempts with our circuit breaker.

The code below is an implementation combining both circuit breaker and backoff.

/
main.go
Combining backoff and circuit breaker

In this lesson, we have discussed the need to have mechanisms to prevent overwhelming our network and services. We have discussed an AWS outage that was partially due to the failure of such mechanisms. We were introduced to the circuit-breaker and backoff mechanisms to prevent these types of failures. Finally, we have shown two popular packages for implementing these mechanisms with examples.

Introduction

Using Rate Limiters To Prevent Runaway Workflows